So far we have been trying to slowly build intuition about inferential models. Now we want to talk about the data itself. In particular, we often need to preprocess the data so it is appropriate for the model to use. This may involve creating new variables as well. This is also known as feature engineering. Again, think of preprocessing as data wrangling for models.
We’ve technically already done some preprocessing: data splitting.
# Load packages.
library(tidyverse)
library(tidymodels)
# Import soup data.
soup_data <- read_csv(here::here("Data", "soup_data.csv"))
# Split data.
soup_split <- initial_time_split(soup_data, prop = 0.90)
Preprocessing data is made easier using the recipes package where common preprocessing tasks have been broken down into steps, beginning with a recipe() where we identify the model outcome and explanatory variables (a.k.a., predictors) using formula notation.
soup_recipe <- training(soup_split) |> recipe(Sales ~ Any_Disp_Spend + Any_Price_Decr_Spend + Sub_Category)
soup_recipe
##
## ── Recipe ──────────────────────────────────────────────────────────────────────
##
## ── Inputs
## Number of variables by role
## outcome: 1 ## predictor: 3
A common problem with data is having right-skewed continuous data, something to identify when we’re doing exploratory data analysis.
Why is right-skewed continuous data common?
# Visualize Sales. soup_data |> ggplot(aes(x = Sales)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
It’s not just Sales.
# Visualize all continuous variables.
soup_data |>
pivot_longer(
c(Sales, Any_Disp_Spend, Any_Price_Decr_Spend),
names_to = "var_names",
values_to = "cont_values"
) |>
ggplot(aes(x = cont_values)) +
geom_histogram() +
facet_wrap(~ var_names)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 704 rows containing non-finite outside the scale range ## (`stat_bin()`).
soup_data |> filter(is.na(Sales) | is.na(Any_Disp_Spend) | is.na(Any_Price_Decr_Spend)) |> select(Sales, Any_Disp_Spend, Any_Price_Decr_Spend)
## # A tibble: 352 × 3 ## Sales Any_Disp_Spend Any_Price_Decr_Spend ## <dbl> <dbl> <dbl> ## 1 33 NA NA ## 2 6413 NA NA ## 3 30 NA NA ## 4 1182 NA NA ## 5 415 NA NA ## 6 107 NA NA ## 7 163 NA NA ## 8 669 NA NA ## 9 132 NA NA ## 10 51 NA NA ## # ℹ 342 more rows
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The linear model assumes normally distributed errors. This means our Y variable should be approximately normal (or at least as close as we can get it)! We can preprocess right-skewed continuous data by using a log() (i.e., logarithmic) transform.
log() transform? What is a logarithm?# Use the log() transform.
soup_data |>
pivot_longer(
c(Sales, Any_Disp_Spend, Any_Price_Decr_Spend),
names_to = "var_names",
values_to = "cont_values"
) |>
mutate(log_cont_values = log(cont_values + 1)) |>
ggplot(aes(x = log_cont_values)) +
geom_histogram() +
facet_wrap(~ var_names, scales = "free")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 704 rows containing non-finite outside the scale range ## (`stat_bin()`).
We add this step to our recipe using step_log(). We can reference all continuous variables in the model with all_numeric(). The argument offset = 1 says to add 1 before using log() so we can avoid log(0).
soup_recipe <- training(soup_split) |> recipe(Sales ~ Any_Disp_Spend + Any_Price_Decr_Spend + Sub_Category) |> step_log(all_numeric(), offset = 1) soup_recipe
##
## ── Recipe ──────────────────────────────────────────────────────────────────────
##
## ── Inputs
## Number of variables by role
## outcome: 1 ## predictor: 3
##
## ── Operations
## • Log transformation on: all_numeric()
How many binary variables will we have when we dummy code this discrete variable?
soup_data |> count(Sub_Category)
## # A tibble: 5 × 2 ## Sub_Category n ## <chr> <int> ## 1 CONDENSED SOUP 410 ## 2 DRY SOUP 1194 ## 3 RAMEN 217 ## 4 RTS SOUP 1310 ## 5 SOUP IN OUT 61
We add this step to our recipe using step_dummy(). We could reference all discrete variables in the model with all_nominal().
soup_recipe <- training(soup_split) |> recipe(Sales ~ Any_Disp_Spend + Any_Price_Decr_Spend + Sub_Category) |> step_log(all_numeric(), offset = 1) |> step_dummy(Sub_Category) soup_recipe
##
## ── Recipe ──────────────────────────────────────────────────────────────────────
##
## ── Inputs
## Number of variables by role
## outcome: 1 ## predictor: 3
##
## ── Operations
## • Log transformation on: all_numeric()
## • Dummy variables from: Sub_Category
We can build the recipe all at once. Once we’re ready, we can prep() the recipe using the training data. The resulting object is a complete recipe that we can apply to whatever data we’d like.
soup_recipe <- training(soup_split) |> recipe(Sales ~ Any_Disp_Spend + Any_Price_Decr_Spend + Sub_Category) |> step_log(all_numeric(), offset = 1) |> step_dummy(Sub_Category) |> prep()
Following the recipe analogy, we apply a prepared recipe to data using bake().
# Apply the recipe to the training data. baked_soup_training <- soup_recipe |> bake(training(soup_split)) # Apply the recipe to the testing data. baked_soup_testing <- soup_recipe |> bake(testing(soup_split))
baked_soup_training
## # A tibble: 2,872 × 7 ## Any_Disp_Spend Any_Price_Decr_Spend Sales Sub_Category_DRY.SOUP ## <dbl> <dbl> <dbl> <dbl> ## 1 0 0 6.73 0 ## 2 9.71 10.8 13.1 0 ## 3 0 0 4.52 0 ## 4 0 7.58 11.8 0 ## 5 0 0 3.93 1 ## 6 0 0 10.2 1 ## 7 0 0 4.91 1 ## 8 0 0 0 1 ## 9 2.30 0 8.74 1 ## 10 0 0 1.61 1 ## # ℹ 2,862 more rows ## # ℹ 3 more variables: Sub_Category_RAMEN <dbl>, Sub_Category_RTS.SOUP <dbl>, ## # Sub_Category_SOUP.IN.OUT <dbl>
baked_soup_testing
## # A tibble: 320 × 7 ## Any_Disp_Spend Any_Price_Decr_Spend Sales Sub_Category_DRY.SOUP ## <dbl> <dbl> <dbl> <dbl> ## 1 0 0 7.61 0 ## 2 0 0 3.50 0 ## 3 0 0 5.12 0 ## 4 5.92 6.65 11.1 0 ## 5 0 0 0 0 ## 6 0 9.33 9.95 0 ## 7 0 0 7.97 1 ## 8 0 0 6.00 1 ## 9 5.63 6.80 6.96 1 ## 10 0 8.58 9.18 1 ## # ℹ 310 more rows ## # ℹ 3 more variables: Sub_Category_RAMEN <dbl>, Sub_Category_RTS.SOUP <dbl>, ## # Sub_Category_SOUP.IN.OUT <dbl>
Let’s compare models with and without the preprocessed data.
# Fit a model without preprocessed data.
fit_raw <- linear_reg() |>
set_engine("lm") |>
fit(
Sales ~ Any_Disp_Spend + Any_Price_Decr_Spend + Sub_Category,
data = training(soup_split)
)
# Fit a model with preprocessed data.
fit_baked <- linear_reg() |>
set_engine("lm") |>
fit(
Sales ~ .,
data = baked_soup_training
)
If we want to use it, there is a horizontal version of the error bars plot.
# Compare parameter estimates. tidy(fit_baked, conf.int = TRUE) |> ggplot(aes(y = term)) + geom_point(aes(x = estimate)) + geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = .1) + geom_vline(xintercept = 0, color = "red")
\[\text{Sales} = 5.42 + 0.12 \times \text{Any_Display_Spend} + \\ ... + .18 \times \text{Sub_Category_RTS.SOUP} + ...\]
How does the log transform change parameter interpretations?
bind_rows(
# Compute RMSE without preprocessed testing data.
predict(fit_raw, new_data = testing(soup_split)) |>
bind_cols(testing(soup_split)) |>
rmse(truth = Sales, estimate = .pred),
# Compute RMSE with preprocessed testing data.
predict(fit_baked, new_data = baked_soup_testing) |>
bind_cols(baked_soup_testing) |>
rmse(truth = Sales, estimate = .pred)
)
## # A tibble: 2 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 41040. ## 2 rmse standard 2.36
bind_rows(
# Compute RMSE without preprocessed testing data.
predict(fit_raw, new_data = testing(soup_split)) |>
bind_cols(testing(soup_split)) |>
rmse(truth = Sales, estimate = .pred),
# Compute RMSE for predictions from preprocessed testing data...using the raw Y values
predict(fit_baked, new_data = baked_soup_testing) |>
bind_cols(testing(soup_split)) |>
mutate(.pred = exp(.pred)) |>
rmse(truth = Sales, estimate = .pred)
)
## # A tibble: 2 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 41040. ## 2 rmse standard 39296.
Wait, there’s more! Having all of the preprocessing steps saved as a recipe means we can easily apply it to new_data when using predict() to evaluate counterfactuals. Construct the counterfactual data in the same form as the raw data so that it will work with our recipe.
soup_data |> select(Any_Disp_Spend, Any_Price_Decr_Spend, Sub_Category, Sales)
## # A tibble: 3,192 × 4 ## Any_Disp_Spend Any_Price_Decr_Spend Sub_Category Sales ## <dbl> <dbl> <chr> <dbl> ## 1 0 0 CONDENSED SOUP 839 ## 2 16489 48977 CONDENSED SOUP 468607 ## 3 0 0 CONDENSED SOUP 91 ## 4 0 1954 CONDENSED SOUP 139993 ## 5 0 0 DRY SOUP 50 ## 6 0 0 DRY SOUP 27456 ## 7 0 0 DRY SOUP 135 ## 8 0 0 DRY SOUP 0 ## 9 9 0 DRY SOUP 6257 ## 10 0 0 DRY SOUP 4 ## # ℹ 3,182 more rows
Allocate $10,000 spend between display and price decreases for each sub-category.
# Counterfactual scenarios. scenarios <- tibble( # compute display spend numbers, repeat them 5 times, once for each category Any_Disp_Spend = seq(from = 0, to = 10000, by = 2000) |> rep(5), # compute price decrease spend numbers, repeat them 5 times, once for each category Any_Price_Decr_Spend = seq(from = 10000, to = 0, by = -2000) |> rep(5), # character vector of subcategories, repeated six times six we have six counterfactual scenarios # in each sub-category Sub_Category = unique(soup_data$Sub_Category) |> rep(6) |> sort(), Sales = 1 )
What’s different about this new_data? How can we interpret each of the rows?
scenarios
## # A tibble: 30 × 4 ## Any_Disp_Spend Any_Price_Decr_Spend Sub_Category Sales ## <dbl> <dbl> <chr> <dbl> ## 1 0 10000 CONDENSED SOUP 1 ## 2 2000 8000 CONDENSED SOUP 1 ## 3 4000 6000 CONDENSED SOUP 1 ## 4 6000 4000 CONDENSED SOUP 1 ## 5 8000 2000 CONDENSED SOUP 1 ## 6 10000 0 CONDENSED SOUP 1 ## 7 0 10000 DRY SOUP 1 ## 8 2000 8000 DRY SOUP 1 ## 9 4000 6000 DRY SOUP 1 ## 10 6000 4000 DRY SOUP 1 ## # ℹ 20 more rows
Remove the outcome and apply the recipe before using the new data for predict().
# Apply the recipe to the training data. baked_scenarios <- soup_recipe |> bake(scenarios) |> select(-Sales) baked_scenarios
## # A tibble: 30 × 6 ## Any_Disp_Spend Any_Price_Decr_Spend Sub_Category_DRY.SOUP Sub_Category_RAMEN ## <dbl> <dbl> <dbl> <dbl> ## 1 0 9.21 0 0 ## 2 7.60 8.99 0 0 ## 3 8.29 8.70 0 0 ## 4 8.70 8.29 0 0 ## 5 8.99 7.60 0 0 ## 6 9.21 0 0 0 ## 7 0 9.21 1 0 ## 8 7.60 8.99 1 0 ## 9 8.29 8.70 1 0 ## 10 8.70 8.29 1 0 ## # ℹ 20 more rows ## # ℹ 2 more variables: Sub_Category_RTS.SOUP <dbl>, ## # Sub_Category_SOUP.IN.OUT <dbl>
Generate point predictions and predictive intervals.
# Predict and bind on prediction intervals. bind_cols( predict(fit_baked, new_data = baked_scenarios), predict(fit_baked, new_data = baked_scenarios, type = "pred_int"), baked_scenarios ) |> arrange(desc(.pred))
## # A tibble: 30 × 9 ## .pred .pred_lower .pred_upper Any_Disp_Spend Any_Price_Decr_Spend ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 11.0 7.21 14.9 7.60 8.99 ## 2 11.0 7.13 14.8 8.29 8.70 ## 3 10.8 7.00 14.7 7.60 8.99 ## 4 10.8 6.96 14.6 8.70 8.29 ## 5 10.8 6.93 14.6 8.29 8.70 ## 6 10.6 6.76 14.4 8.70 8.29 ## 7 10.5 6.69 14.4 7.60 8.99 ## 8 10.5 6.63 14.3 8.99 7.60 ## 9 10.5 6.61 14.3 8.29 8.70 ## 10 10.3 6.46 14.1 0 9.21 ## # ℹ 20 more rows ## # ℹ 4 more variables: Sub_Category_DRY.SOUP <dbl>, Sub_Category_RAMEN <dbl>, ## # Sub_Category_RTS.SOUP <dbl>, Sub_Category_SOUP.IN.OUT <dbl>
We can manually wrangle everything that a step function can do, but step functions can be a lot easier to use. Beyond step_log() and step_dummy(), for example:
step_corr() to check for multicollinearity and drop explanatory variables.step_normalize() to subtract the mean and divide by the variance.step_discretize() to make a continuous variable discrete.step_num2factor() to quickly turn a numeric variable into a factor.We also have additional helper functions like all_predictors().
Summary
Next Time
Supplementary Material
Artwork by @allison_horst
Return again to soup_data and your models used in Exercises 11 and 12.
Sales using three possible ways to split $6000 for the promotions budget. As with the training and testing data, be sure to preprocess the new data. What should the client do with their promotional budget? Provide an interpretation using dollars.